Monitoring sucks, don’t make it worse

You don’t have to go too far to find someone who thinks monitoring sucks. It’s definitely true that monitoring can be big, ugly, and complicated. I’m convinced that many of the problems in monitoring are not technical, but policy issues. For the sake of clarity (and because I’m like that), let’s start with some definitions. These definitions may or may not have validity outside the scope of this post, but at least they will serve to clarify what I mean when I say things.

  • Monitoring – an automatic process to collect metrics on a system or service
  • Alerting – notification when a critical threshold has been reached

In the rest of this post, I will be throwing some former colleagues under the bus. It’s not personal, and I’m responsible for some of the problem as well. The group in question has a monitoring setup that is dysfunctional to the point of being worthless. Not all of the problems are policy-related, but enough are to prompt this post. It should be noted that I’m not an expert on this subject, just a guy with opinions and a blog.

Perhaps the most important thing that can be done when setting up a monitoring system is coming up with a plan. It sounds obvious, but if you don’t know what you’re monitoring, why you’re monitoring it, and how you’re monitoring it, you’re bound to get it wrong. This is my first rule: in monitoring, failing to plan is planning to not notice failure.

It’s important to distinguish between monitoring and alerting. You can’t alert on what you don’t monitor, but you don’t need to alert on everything you monitor. This is one area where it’s easy to shoot yourself in the foot, especially at a large scale. Many of the monitoring checks were in reaction to something going wrong. As a result, Nagios ended up alerting for things like “a compute node has 95% memory utilization.” For servers, that’s important. For nodes, who cares? The point of the machines is to do computation. Sometimes that means chewing up memory.

Which brings me to rule number two: every alert should have a reaction. If you’re not going to do something about an alert, why have it in the first place? It’s okay to monitor without alerting — the information can be important in diagnosing problems or analyzing usage — but if an alert doesn’t result in a human or automated reaction, shut it off.

Along that same line, alerts should be a little bit painful. Don’t punish yourself for something failing, but don’t make alerts painless either. Perhaps the biggest problem in the aforementioned group is that most of the admins filtered Nagios messages away. That immediately killed any incentive to improve the setup.

I took the alternate approach and weakly lobbied for all alerts to hit the pager. This probably falls into the “too painful” category. You should use multiple levels of alerts. An email or ticket is fine for something that needs to be acted on but can wait until business hours. A more obnoxious form of alert should be used for the Really Important Things[tm].

The great thing about having a little bit of pain associated with alerts is that it also acts as incentive to fix false alarms. At one point, I wrote Nagios checks to monitor HTCondor daemons. Unfortunately, due to the load on the Nagios server, the checks would timeout and produce alerts. The daemons were fine and the cond0r_master process generally does a good job of keeping things under control. So I removed the checks.

The opposite problem is running checks outside the monitoring system. One colleague had a series of cron jobs that checked the batch scheduler. If the checks failed, he would email the group. Don’t work outside the system.

Finally, be sure to consider planned outages. If you can’t suppress alerts when things are broken intentionally, you’re going to have a bad time. As my friend tweeted: “Rough estimates indicate we sent something like 180,000 emails when our clusters went down for maintenance.”

CERIAS Recap: Featured Commentary and Tech Talk #3

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is the final post summarizing the talks I attended.

I’m combining the last two talks into a single post. The first was fairly short, and by the time the second one rolled around, my brain was too tired to focus.

Thursday afternoon included a featured commentary from The Honorable Mark Weatherford, Deputy Undersecretary of Cybersecurity at the U.S. Department of Homeland Security. Mr. Weatherford was originally scheduled to speak at the Symposium, but restrictions in federal travel budgets forced him to present via pre-recorded video. Mr. Weatherford opened with an observation that “99% secure means 100% vulnerable.” There are many cases where a single failure in security resulted in compromise.

The cyber threat is real. DHS Secretary Napolitano says infrastructure is dangerously vulnerable to cyber attack. Banks and other financial institution have been under sustained DDoS attack and it has become very predictable. In the future, there will be more attacks, they will be more disruptive, and they will be harder to defend against.

So what does DHS do in this space? DHS provides operational protection for the .gov domain. They work with the .com sector to improve protection, especially against critical infrastructure. DHS responds to national events and works with other agencies to foster international cooperation.

Cybersecurity got two paragraphs in President Obama’s 2013 State of the Union address. Obama’s recent cybersecurity executive order has goals of establishing an up-to-date cybersecurity network and enhancing information sharing among key stakeholders. DHS is involved in the Scholarship for Service student program which is working to create professionals to meet current and future needs.

The final session was a tech talk by Stephen Elliott, Associate Professor of Technology Leadership and Innovation at Purdue University, entitled “What is missing in biometric testing.” Traditional biometric testing is algorithmic, with well-established metrics and methodologies. Operation testing is harder to do because test methodologies are sometimes dependent on the test. Many papers have been written about the contributions of individual error on performance. Some papers have been written on the contribution of metadata error. Elliott is focused on training: how do users get accustomed to devices, how they remember how to use them, and how can training be provided to users with a consistent message.

One way to improve biometrics is understanding the stability of the user’s response. If we know how stable a subject is, we can reduce the transaction time by requiring fewer measurements. Many factors, including the user, the agent, and system usability affect the performance of biometeric systems. Improving performance is not a matter of simply improving the algorithms, but improving the entire system.

Other posts from this event:

CERIAS Recap: Panel #3

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is one of several posts summarizing the talks I attended.

The “E” in CERIAS stands for “Education”, so it comes as no surprise that the Symposium would have at least one event on the topic. On Thursday afternoon, a panel addressed issues in security education and training. I found this session particularly interesting because it paralleled many discussions I have had about education and training for system administrators.

Interestingly, the panel consisted entirely of academics. That’s not particularly a surprise, but it does bias the discussion toward higher education issues and not vocational-type training. This is often a contentious issue in operations education discussions. I’m not sure if such a divide exists in the infosec world. Three Purdue professors sat on the panel: Allen Gray, Professor of Agriculture; Melissa Dark, Professor of Computer & Information Technology and Associate Directory of Educational Programs at CERIAS; and Marcus Rogers, Professor of Computer & Information Technology. They were joined by Ray Davidson, Dean of Academic Affairs at the SANS Technology Institute; and Diana Burley, Associate Professor of Human and Organizational Learning at The George Washington University.

Professor Gray began the opening remarks by telling the audience he had no cyber security experience. His expertise is in distance learning, as he is the Director of a MS/MBA distance program in food and agribusiness management. The rise of MOOCs has made information more available than ever before, but Gray notes that merely providing the information is not education. The MS/MBA program offers a curriculum, not just a collection of courses, and requires interaction between students and instructors.

Dean Davidson is in charge of the master’s degree programs offered by the SANS Technology Institute. This is a new offering and they are still working on accreditation. Although it incorporates many of the SANS training courses, it goes beyond those. “The old days of protocol vulnerabilities are starting to go away, but people still need to know the basics,” he said. “Vulnerabilities are going up the stack. We’re at layers 9 and 10 now.” Students need training in legal issues and organizational dynamics in order to become truly effective practitioners.

Professor Dark joined CERIAS without any experience in providing cybersecurity education. In her opening remarks, she talked about the appropriate use of language: “We always talk about the war on defending ourselves, the war on blah. We’re not using the language right. We should reserve ‘professionalization’ for people who deal with a lot of uncertainty and a lot of complexity.” Professor Burley also discussed vocabulary. We need to consider who is the cybersecurity workforce. Most cybersecurity professionals are in hybrid roles, so it’s not appropriate to focus on the small number who have roles entirely focused on cybersecurity.

Professor Rogers drew parallels to other professions. Historically, professionals of any type have been developed through training, certification, education, apprenticeship or some combination of those. In cybersecurity, all of these methods are used. Educators need to consider what a professional in the field should know, and there’s currently no clear-cut answer. How should education respond? “Better than we currently are.” Rogers advocates abandoning the stove pipe approach. Despite talk of being multidisciplinary, programs are often still very traditional.”We need to bring back apprenticeship and mentoring.”

The opening question addressed differences between education and training. Gray reiterated that disseminating information is not necessarily education; education is about changing behavior. Universities tend to focus on theory, but professionalization is about applying that theory. As the talk drifted toward certifications, which are often the result of training, Rogers said “we’re facing the watering-down of certifications. If everybody has a certification, how valuable is it?” Dark launched a tangent when she observed that cybersecurity is in the same space as medicine: there’s so much that practitioners can’t know. This lead to a distinction being made (by Spafford, if I recall correctly) between EMTs and brain surgeons as an analogy for various cybersecurity roles. Rogers said we need both.They are different professions, Burley noted, but they both consider themselves professionals.

One member of the audience said we have a great talent pool entering the work force, but they’re all working on same problems. How many professionals do we need? Davidson said “we need to change the whole ecosystem.” When the barn is on fire, everyone’s a part of the bucket brigade; nobody has time to design a better barn or better fire fighting equipment. Burley pointed out that the NSF’s funding of scholarships in cybersecurity is shifting toward broader areas, not just computer science. This point was reinforced by Spafford’s observation that none of the panelists have their terminal degree in computer science. “If we focus on the job openings that we have right now,” Rogers said, “we’re never going to catch up with the gaps in education.” One of the panelists, in regard to NSF and other efforts, said “you can’t rely on the government to be visionary. You might be able to get the government to fund vision,” but not set it.

The final question was “how do you ensure that ethical hackers do not become unethical hackers?” Rogers said “in education, we don’t just give you knowledge, we give you context to that knowledge.” Burley drew a parallel to the Hippocratic Oath and stressed the importance of socialization and culturalization processes. Davidson said the jobs have to be there as well. “If people get hungry, things change.”

Other posts from this event:

CERIAS Recap: Fireside Chat

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is one of several posts summarizing the talks I attended.

The end of Christopher Painter’s talk transitioned nicely into the Fireside Chat with Painter and CERIAS Executive Director Gene Spafford. Spafford opened the discussion with a topic he tried to get the first panel to address: privacy. “Many people view security as the most important thing,” Spafford observed, which results in things like CISPA which would allow unlimited and unaccountable sharing of data with government. According to Painter, privacy and security “are not incompatible.” The Obama administration works to ensure civil liberty and privacy protections are built-in. Painter also disagreed with Spafford’s assertion that the U.S. is behind Europe in privacy protection. The U.S. and the E.U. want interoperable privacy rules. They’re not going to be identical, but they should work together. Prosecution of cyber attacks, according to Painter, aids privacy in the long run.

An audience member wanted to know how do to address the risk of attribution and proportional response now that cyber defense is transitioning from passive to active. Painter noted that vigilante justice is dangerous due to the possibility of misattribution and the risk of escalating the situation. “I don’t advocate a self-help approach,” he said.

Another in the audience expressed concern with voluntary standards concern me, observing that compliance is spotty in regulated industries (e.g. health care). He wondered if these voluntary international standards were intended to be guidance or effective? Painter said they are intended to set a “standard of care”. Governments will need to set incentives and mechanisms to foster compliance. Spafford pointed out that there are two types of standards: minimum standards and aspirational standards. Standards can also institutionalize bad behavior, so it is important to set the right standards.

Painter had earlier commented that progress has been structurally. An audience member wondered where the gaps remain. The State Department, according to Painter, is a microcosm of the rest of the Executive Branch. Within State, they’ve gone a good job of getting the parts of the agency working well together. They weren’t cooperating operationally as much as we could, but that’s improved, too. Spafford asked about state-level coooperation. 9/11 drove a great deal of state cooperation, but we’re now beginning to see states participate more in cyber efforts.

One member of the audience said “without accountability, you have no rule of law. How do you have accountability on the Internet?” Painter replied there are two sides to the coin: prevention and response. Response is more difficult. there have been efforts by the FBI and others in the past few years to step up enforcement and response. Spafford pointed out that even if an attack has been traced to another country with good evidence, the local government will sometimes deny it. Can they be held accountable? We have to build the consensus that this is important, said Painter. If you’re outside that consensus you will become isolated. A lot of countries in the developing world are still building capabilities. They want to stop it, but they can’t. Cybercrime is often used to facilitate traditional crime. That might be a lever to help encourage cooperation from other nations.

Fresh off this mornings attack of North Korean social media accounts, the audience wanted to hear comments on  Anonymous attacking governments. “If you’re doing something that’s a crime,” Painter said, “it’s a crime.” Improving attribution can help prevent or prosecute these attackers. The conversation moved to the classification of information when Spafford observed that some accuse goverments of over-classifying information. Painter said that has not been his experience. When people reveal classified information, that damages a lot of efforts. We have to balance speech and protection. The openness of the Internet is key.

Two related questions were asked back to back. The first questioner observed that product manufacturers are good at externalizing the cost of insecurity and asked how producers can be incentivized to produce more secure products. The second question dealt with preventing misuse of technology, with The Onion Router being cited as an example of a program used for both good and bad. Painter said the market for security is increasing, with consumers becoming more willing to pay for security. Industry is looking at how to move security away from the end user in order to make it more transparent. Producers can’t tell how their work will be used, but even when technology is used to obscure attribution, there are other ways to trace criminals (for example, money trails).

One other question asked how we address punishment online. Painter said judges have discretion in sentences and U.S. sentencing laws are “generally pretty rational.”  The penalities in cyberspace are generally tied to the penalties in the digital world. In seeming contradiction, Spafford pointed out that almost everything in the Computer Fraud and Abuse Act is a felony and asked Painter if there is room to have more misdemeanor offenses in federal law? Painter said there are misdemeanor offenses in state and local laws. Generally, Spafford says, policymakers need better understanding of tech, but tech people need better understanding of law.

There were other aspects of this discussion that I struggle to summarize (especially given the lengthy nature of this post). I do think this was the most interesting session of the entire symposium, at least for me. I’ve recently found my interest in law and policy increasing, and I lament the fact that I’ve nearly completed my master’s degree at this point. I actually caught myself thinking about a PhD this morning, which is an absolutely unnecessary idea at this stage in my life.

Other posts from this event:

CERIAS Recap: Thursday keynote

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is one of several posts summarizing the talks I attended.

Thursday’s keynote address was delivered by Christopher Painter, the Coordinator for Cyber Issues at the U.S. State Department. Mister Painter has a long and distinguished career in law and policy, starting with the U.S. Attorney’s office in Los Angeles, and moving through several roles in the Justice Department. He served as acting Cyber Czar during his time in the White House, and finally ended up in the State Department.

Cyber security issues have started receiving increased attention in recent years. Painter said President Obama came to the White House with a unique understanding of security because his 2008 campaign was hacked. In his 2013 State of the Union address, Mr. Obama became the first president to address cyber security on such a stage.

As Todd Gebhart noted the morning before, conversation has evolved from being purely technical to involving senior policy officials. This requires the technical community to work with the policy community so that they policy is informed. Painter takes heart in observing senior officials discuss cyber security issues beyond the scope of their prepared notes.

Although the State Department has a role in responding to DoS attacks against diplomatic institutions, the primary focus seems to be on fostering international cooperation. The international nature of cyber crime makes it very difficult to combat. Many different targets and intents are involved, as well. Although there have not been any [publicly reported] terrorist attacks on critical infrastructure, the threat exists. There are financial motivations for other cyber crimes. For example, one man spoofed Bloomberg web pages to publish fake articles in order to manipulate the stock price of a company. Although he got cold feet about executing the trade, people lost money in their own trades.

Regardless of the specific incident, the international nature of cyber crime makes it difficult to pursue and prosecute offenders. Some governments are more interested in “regime security”, protecting the interests of their own authoritarian states. The goal of U.S. cyber policy is an open, secure, reliable Internet system. To accomplish this, the State Department is promoting a shared framework of existing norms grounded in existing international law. Larger embassies have created “cyber attache” positions in order to help foster international cooperation.

Other posts from this event:

CERIAS Recap: Panel #1

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is one of several posts summarizing the talks I attended. This post will also appear on the CERIAS Blog.

With “Big Data” being a hot topic in the information technology industry at large, it should come as no surprise that it is being employed as a security tool. To discuss the collection and analysis of data, a panel was assembled from industry and academia. Alok Chaturvedi, Professor of Management, and Samuel Liles Associate Professor of Computer and Information Technology, both of Purdue Unversity, represented academia. Industry representatives were Andrew Hunt, Information Security Research at the MITRE Corporation, Mamani Older, Citigroup’s Senior Vice President for Information Security, and Vincent Urias, a Principle Member of Technical Staff at Sandia National Laboratories. The panel was moderated by Joel Rasmus, the Director of Strategic Relations at CERIAS.

Professor Chaturvedi made the first opening remarks. His research focus is on reputation risk: the potential damage to an organization’s reputation – particularly in the financial sector. Reputation damage arises from the failure to meet the reasonable expectations of stakeholders and has six major components: customer perception, cyber security, ethical practices, human capital, financial performance, and regulatory compliance. In order to model risk, “lots and lots of data” must be collected; reputation drivers are checked daily. An analysis of the data showed that malware incidents can be an early warning sign of increased reputation risk, allowing organizations an opportunity to mitigate reputation damage.

Mister Hunt gave brief introductory comments. The MITRE Corporation learned early that good data design is necessary from the very beginning in order to properly handle a large amount of often-unstructured data. They take what they learn from data analysis and re-incorporate it into their automated processes in order to reduce the effort required by security analysts.

Mister Urias presented a less optimistic picture. He opened his remarks with the assertion that Big Data has not fulfilled its promise. Many ingestion engines exist to collect data, but the analysis of the data remains difficult. This is due in part to the increasing importance of meta characteristics of data. The rate of data production is challenging as well. Making real-time assertions from data flow at line rates is a daunting problem.

Ms. Older noted that Citigroup gets DDoS attacks every day, though some groups stage attacks on a somewhat predictable schedule. As a result, Citigroup employs a strong perimeter defense. She noted, probably hyperbolically, that it takes 20 minutes to boot her laptop. Despite the large volume of data produced by the perimeter defense tools, they don’t necessarily have good data on internal networks.

Professor Liles focused on the wealth of metrics available and how most of them are not useful. “For every meaningless metric,” he said, “I’ve lost a hair follicle. My beard may be in trouble.” It is important to focus on the meaningful metrics.

The first question posed to the panel was “if you’re running an organization, do you focus on measuring and analyzing, or mitigating?” Older said that historically, Citigroup has focused on defending perimeters, not analysis. With the rise of mobile devices, they have recognized that mere mitigation is no longer sufficient. The issue was put rather succinctly by Chaturvedi: “you have to decide if you want to invest in security or invest in recovery.”

How do organizations know if they’re collecting the right data. Hunt suggested collecting everything, but that’s not always an option, especially in resource-starved organizations. Understanding the difference between trend data and incident data is important, according to Liles, and you have to understand how you want to use the data. Organizations with an international presence face unique challenges since legal restrictions and requirements can vary from jurisdiction-to-jurisdiction.

Along the same lines, the audience wondered how long data should be kept. Legal requirements sometimes dictate how long data should be kept (either at a minimum or maximum) and what kind of data may be stored. The MITRE corporation uses an algorithmic system for the retention and storage medium for data. Liles noted that some organizations are under long-term attack and sometimes the hardware refresh cycle is shorter than the duration of the attack. Awareness of what local log data is lost when a machine is discarded is important.

Because much of the discussion had focused on ways that Big Data has failed, the audience wanted to know of successes in data analytics. Hunt pointed to the automation of certain analysis tasks, freeing analysts to pursue more things faster. Sandia National Labs has been able to correlate events across systems and quantify sensitivity effects.

One audience member noted that as much as companies profess a love for Big Data, they often make minimal use of it. Older replied that it is industry-dependent. Where analysis drives revenue (e.g. in retail), it has seen heavier use. An increasing awareness of analysis in security will help drive future use.

Other posts from this event:

CERIAS Recap: Opening Keynote

Once again, I’ve attended the CERIAS Security Symposium held on the campus of Purdue University. This is the first of several posts summarizing the talks I attended.

The opening keynote was delivered by Todd Gebhart, the co-president of McAfee, Inc. Mr. Gebhart opened by reminding the audience that a “certain individual” who happens to share a name with the company is no longer involved with the McAfee corporation. Gebhart set the stage by addressing why McAfee employees go to work every day. The company focuses on protecting four areas: personal, business, government, and critical infrastructure.

The nature of security has changed over the years. In 1997, updates to antivirus subscriptions were physically mailed on disk to McAfee customers every three months. 17,000 known pieces of malware had been identified. Today, a growth in the number of connected devices has spurred a growth in malware. McAfee estimates one billion devices are connected to the Internet today, a number which is forecast to grow to 50 billion by 2020. Despite improvements in security procedures and products, the rate of growth in malware does not appear to be slowing.

The growth rate is greatest for mobile devices, where “only” 36,000 unique pieces of malware are known to exist (according to a preliminary study, 4% of all mobile apps are designed with mal intent). Consolidation of mobile operating systems into two main players (iOS and Android) has made it easier for malware writers. The nature of the threat on mobile has changed as well. Whereas desktop and server-based attacks were often about gaining control of or denying service to a machine, mobile threats are more focused on the loss of data and devices. The addition of WiFi, while of considerable benefit to users, has opened up a whole new realm of attack vectors that did not exist a few years ago.

Gebhart gave a brief survey of current malware threats in the four sectors listed above. He noted that attacks are no longer about machines; they’re about people and organizations. Accordingly, spam and botnets are becoming less of a concern in favor of malicious URLs. Behavior- and pattern-based attacks allow bad actors to focus their efforts more efficiently, and the development of Hacker-as-a-Service (HaaS) offerings allows for attackers with little-to-no technical knowledge.

The evolving threat has lead to greater awareness among non-technical business leaders. Security companies are now having discussions not only with technical leadership in organizations, but also with high level business and government leaders.

The industry is evolving to face the new and emerging threats. The use of real-time data to make real-time decisions can improve the response to attacks, or perhaps prevent them. Multi-organization cooperation can help defend against so-called “trial-and-error” attacks. Cloud-based threat intelligence allows McAfee to analyze malware traffic across 120 million devices worldwide. Hardware and software vendors are working together (or in the case of Intel, buying McAfee) to develop systems that can detect malware at the hardware interaction layer.

Gebhart closed by saying “it’s an exciting time to be in security” and noting that his company is always looking for talented security researchers and practitioners.

Other posts from this event: