Machine Learning Site Reliability

Reliability Learning Tool
Machine Learning Site Reliability Calculator
Learning Site Wayside Publishing

'Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to creat. Sep 14, 2018 - Understanding How to Adopt Site Reliability Engineering (SRE) process and Working Architecture and Comparison of SRE vs DevOps. Part of a site reliability engineer’s job is to set those rules, create the tools needed to automate all the processes, and facilitate the deployment and rollback of new services or changes to existing ones. Part of the change management process is making sure that changes and any new services that will be deployed comply with a list of.

networking (especially 5G mobile which is not included in this segmentation), and security and finally application software and all the service offerings that will be required. Companies like Philips, Rio Tinto Group and Dell, Hewlett Packard Enterprise, Machine learning site reliability 2017

IBM, AT&T, Intel, the ARM community and more small / startup companies than can be counted. For certain, we can expect to see IoT everywhere.

Much of the industry discussion has been around the endpoint devices. In the general case, there will be two broad classes: sensors which will predominately gather data (example: monitoring environmental conditions or operations) and what we are calling “kinetic” devices which will be capable of doing work (examples: alarms, locks, valve actuators). By 2020, these devices will generate a staggering amount of data at the edge or the area where the endpoints operate. Decisions of what data to keep, ignore, and what to forward to a centralized authority will be required. Many of the kinetic devices will be used and application whose action can neither tolerate long latency nor risk the possibility that the connection with the centralized authority (“the cloud”) is not available. Their decisions must be made instantly with local information and knowledge. Most IoT endpoints will be limited in capabilities due to size, cost, and the power requirements and will need companion computing that is either embedded in the larger system or in a companion gateway. These gateways will primarily bridge between the local device communication domains and higher level network domains and will in most cases make behavioral decisions. As the industry matures, these gateways will also be responsible for allowing data to be exchanged between intended devices, and ensuring the information is protected. Network traffic patterns will be significantly impacted as more device-to-endpoint traffic will occur and more machine-to-machine communication will materialize, shifting from today’s patterns. However, these solutions will not be static, and their evolving behavior will need to vary depending on local characteristics, giving rise to more software-defined functions at both the edge and within the datacenter. Further, their numbers will be vast and their operation cannot require human intervention.

Enter the need for automation and intelligence…machine learning.

Machine learning is defined as the ability of a machine to vary the outcome of a situation or behavior based on knowledge or observation which is essential for IoT solutions. Interestingly, the knowledge can come in a variety of forms and does not necessarily need to be created locally. In other words, knowledge that is created at a given place can be exported and used in many other locations or “Train one and you train them all”. An example is threat management and protection which will need constant evolution or knowledge / learning.

There are two forms of knowledge: 1. observed knowledge which will modify behavior based on local learning (usually referred to as training) and 2. directed knowledge where knowledge created elsewhere (by a central authority) will be used to modify edge behavior. In the fullness of this notion, you can look at this as machine learning (with a small “m.l.”) where edge behavior is modified and Machine Learning (with a capital “M.L.”) where global trends are observed and policies that provides control are set. M.L. also has a larger role as the source of directed learning to modify behavior at the edge. By 2020 MI&S believes that the “machine learning and Machine Learning” arrangement will exist in a large number of solutions and will account for a great deal of the innovation in IoT world. Clearly IBM believes that as they have created the Watson Machine Learning Internet of Things as does Google with the creation of TensorFlow (at least for the machine learning part). So, hold on to your hats! It is going to be a wild ride.

Disclaimer: The views and opinions represented in this work are mine and mine alone. They do not knowingly represent the views of any other individual or individuals living, dead, or otherwise. I make NO guarantee as to their accuracy NOR are they necessarily an accurate predictor of the future. They are the product of a partially lucid mind resulting from nearly four decades of fun in the information technology industry and the long term impact of said chaos. You should read them, possibly understand them, and immediately discard them…Other than that, I hope you find them useful.

Further Disclosure: My firm, Moor Insights & Strategy, like all research and analyst firms, provides or has provided research, analysis, advising, and / or consulting to many high-tech companies in the industry including some of the companies mentioned in this document. I do not knowingly hold any direct equity positions in those cited, but may have positions in retirement funds managed by a third party.

IDC estimates the direct Internet of Things (IoT) market will grow to more than $1.7 trillion by 2020 with a compound annual growth rate (CAGR) of 16.9%. The number of IoT connected devices in everything from cars to refrigerators and elsewhere will climb to more than 30 billion. Device endpoints, infrastructure support, connectivity and companion IT services are expected to account for the majority of spending. Moor Insights & Strategy (MI&S) believes the revenue will be split into roughly 3 grouping of about $500B each. While the number of actual instances will vary wildly, endpoint devices will account for about 1/3 of the spending with the remaining 2/3 split between purpose-built platforms, storage, networking (especially 5G mobile which is not included in this segmentation), and security and finally application software and all the service offerings that will be required. Companies like General Electric, Philips, Ford Motor, Rio Tinto Group and Stanley Black & Decker are just a few of the companies focused in this space with huge support from companies like Dell, Hewlett Packard Enterprise, IBM, AT&T, Verizon Communications, Intel, the ARM community and more small / startup companies than can be counted. For certain, we can expect to see IoT everywhere.

Much of the industry discussion has been around the endpoint devices. In the general case, there will be two broad classes: sensors which will predominately gather data (example: monitoring environmental conditions or operations) and what we are calling “kinetic” devices which will be capable of doing work (examples: alarms, locks, valve actuators). By 2020, these devices will generate a staggering amount of data at the edge or the area where the endpoints operate. Decisions of what data to keep, ignore, and what to forward to a centralized authority will be required. Many of the kinetic devices will be used and application whose action can neither tolerate long latency nor risk the possibility that the connection with the centralized authority (“the cloud”) is not available. Their decisions must be made instantly with local information and knowledge. Most IoT endpoints will be limited in capabilities due to size, cost, and the power requirements and will need companion computing that is either embedded in the larger system or in a companion gateway. These gateways will primarily bridge between the local device communication domains and higher level network domains and will in most cases make behavioral decisions. As the industry matures, these gateways will also be responsible for allowing data to be exchanged between intended devices, and ensuring the information is protected. Network traffic patterns will be significantly impacted as more device-to-endpoint traffic will occur and more machine-to-machine communication will materialize, shifting from today’s patterns. However, these solutions will not be static, and their evolving behavior will need to vary depending on local characteristics, giving rise to more software-defined functions at both the edge and within the datacenter. Further, their numbers will be vast and their operation cannot require human intervention.

Enter the need for automation and intelligence…machine learning.

Reliability Learning Tool

There are two forms of knowledge: 1. observed knowledge which will modify behavior based on local learning (usually referred to as training) and 2. directed knowledge where knowledge created elsewhere (by a central authority) will be used to modify edge behavior. In the fullness of this notion, you can look at this as machine learning (with a small “m.l.”) where edge behavior is modified and Machine Learning (with a capital “M.L.”) where global trends are observed and policies that provides control are set. M.L. also has a larger role as the source of directed learning to modify behavior at the edge. By 2020 MI&S believes that the “machine learning and Machine Learning” arrangement will exist in a large number of solutions and will account for a great deal of the innovation in IoT world. Clearly IBM believes that as they have created the Watson Machine Learning Internet of Things as does Google with the creation of TensorFlow (at least for the machine learning part). So, hold on to your hats! It is going to be a wild ride.

How can we build systems that will perform well in the presence of novel, even adversarial, inputs? What techniques will let us safely build and deploy autonomous systems on a scale where human monitoring becomes difficult or infeasible? Answering these questions is critical to guaranteeing the safety of emerging high stakes applications of AI, such as self-driving cars and automated surgical assistants.

Machine Learning Site Reliability Calculator

This workshop will bring together researchers in areas such as human-robot interaction, security, causal inference, and multi-agent systems in order to strengthen the field of reliability engineering for machine learning systems. We are interested in approaches that have the potential to provide assurances of reliability, especially as systems scale in autonomy and complexity.
We will focus on five aspects — robustness, awareness, adaptation, value learning, and monitoring -- that can aid us in designing and deploying reliable machine learning systems. Some possible questions touching on each of these categories are given below, though we also welcome submissions that do not directly fit into these categories.

Learning Site Wayside Publishing

Robustness: How can we make a system robust to novel or potentially adversarial inputs? What are ways of handling model mis-specification or corrupted training data? What can be done if the training data is potentially a function of system behavior or of other agents in the environment (e.g. when collecting data on users that respond to changes in the system and might also behave strategically)?
Awareness: How do we make a system aware of its environment and of its own limitations, so that it can recognize and signal when it is no longer able to make reliable predictions or decisions? Can it successfully identify “strange” inputs or situations and take appropriately conservative actions? How can it detect when changes in the environment have occurred that require re-training? How can it detect that its model might be mis-specified or poorly-calibrated?
Adaptation: How can machine learning systems detect and adapt to changes in their environment, especially large changes (e.g. low overlap between train and test distributions, poor initial model assumptions, or shifts in the underlying prediction function)? How should an autonomous agent act when confronting radically new contexts?
Value Learning: For systems with complex desiderata, how can we learn a value function that captures and balances all relevant considerations? How should a system act given uncertainty about its value function? Can we make sure that a system reflects the values of the humans who use it?
Monitoring: How can we monitor large-scale systems in order to judge if they are performing well? If things go wrong, what tools can help?